Recent research has reported a performance degradation in self-supervised contrastive learning for specially designed efficient networks, such as MobileNet and EfficientNet. A common practice to address this problem is to introduce a pretrained contrastive teacher model and train the lightweight networks with distillation signals generated by the teacher. However, it is time and resource consuming to pretrain a teacher model when it is not available. In this work, we aim to establish a stronger baseline for lightweight contrastive models without using a pretrained teacher model. Specifically, we show that the optimal recipe for efficient models is different from that of larger models, and using the same training settings as ResNet50, as previous research does, is inappropriate. Additionally, we observe a common issu e in contrastive learning where either the positive or negative views can be noisy, and propose a smoothed version of InfoNCE loss to alleviate this problem. As a result, we successfully improve the linear evaluation results from 36.3\% to 62.3\% for MobileNet-V3-Large and from 42.2\% to 65.8\% for EfficientNet-B0 on ImageNet, closing the accuracy gap to ResNet50 with $5\times$ fewer parameters. We hope our research will facilitate the usage of lightweight contrastive models.
translated by 谷歌翻译
Time series motif discovery has been a fundamental task to identify meaningful repeated patterns in time series. Recently, time series chains were introduced as an expansion of time series motifs to identify the continuous evolving patterns in time series data. Informally, a time series chain (TSC) is a temporally ordered set of time series subsequences, in which every subsequence is similar to the one that precedes it, but the last and the first can be arbitrarily dissimilar. TSCs are shown to be able to reveal latent continuous evolving trends in the time series, and identify precursors of unusual events in complex systems. Despite its promising interpretability, unfortunately, we have observed that existing TSC definitions lack the ability to accurately cover the evolving part of a time series: the discovered chains can be easily cut by noise and can include non-evolving patterns, making them impractical in real-world applications. Inspired by a recent work that tracks how the nearest neighbor of a time series subsequence changes over time, we introduce a new TSC definition which is much more robust to noise in the data, in the sense that they can better locate the evolving patterns while excluding the non-evolving ones. We further propose two new quality metrics to rank the discovered chains. With extensive empirical evaluations, we demonstrate that the proposed TSC definition is significantly more robust to noise than the state of the art, and the top ranked chains discovered can reveal meaningful regularities in a variety of real world datasets.
translated by 谷歌翻译
为了在盲图超级分辨率(SR)上取得有希望的结果,一些尝试利用低分辨率(LR)图像来预测内核并改善SR性能。但是,由于不可用的现实世界模糊内核,这些监督的内核预测(SKP)方法是不切实际的。尽管提出了一些无监督的降解预测(UDP)方法来绕过此问题,但\ textIt {contercestency}之间的降解嵌入和SR功能之间仍然具有挑战性。通过探索降解嵌入与SR功能之间的相关性,我们观察到共同学习内容和降解感知功能是最佳的。基于此观察结果,提出了一个名为CDSR的内容和退化的SR网络。具体而言,CDSR包含三个新建立的模块:(1)将基于重量的编码器(LPE)应用于共同提取内容和降解功能; (2)采用基于域查询的基于注意力的模块(DQA)来适应不一致; (3)基于密码的空格压缩模块(CSC),可以抑制冗余信息。对几个基准测试的广泛实验表明,即使与最先进的SKP方法相比,提议的CDSR的表现都优于现有的UDP模型,并在PSNR和SSIM上实现竞争性能。
translated by 谷歌翻译
我们建议采用统计回归作为投影操作员,以使数据驱动以数据为基础的Mori-Zwanzig形式主义中的运营商学习。我们提出了一种原则性方法,用于为任何回归模型提取Markov和内存操作员。我们表明,线性回归的选择导致了基于Mori的投影操作员最近提出的数据驱动的学习算法,这是一种高阶近似Koopman学习方法。我们表明,更具表现力的非线性回归模型自然填补了高度理想化和计算有效的MORI投影操作符和最佳迄今为止计算上最佳的Zwanzig投影仪之间的差距。我们进行了数值实验,并提取了一系列基于回归的投影的运算符,包括线性,多项式,样条和基于神经网络的回归,随着回归模型的复杂性的增加而显示出渐进的改进。我们的命题提供了一个通用框架来提取内存依赖性校正,并且可以轻松地应用于文献中固定动力学系统的一系列数据驱动的学习方法。
translated by 谷歌翻译
与此同时,黑匣子对抗攻击已经吸引了令人印象深刻的注意,在深度学习安全领域的实际应用,同时,由于无法访问目标模型的网络架构或内部权重,非常具有挑战性。基于假设:如果一个例子对多种型号保持过逆势,那么它更有可能将攻击能力转移到其他模型,基于集合的对抗攻击方法是高效的,用于黑匣子攻击。然而,集合攻击的方式相当不那么调查,并且现有的集合攻击只是均匀地融合所有型号的输出。在这项工作中,我们将迭代集合攻击视为随机梯度下降优化过程,其中不同模型上梯度的变化可能导致众多局部Optima差。为此,我们提出了一种新的攻击方法,称为随机方差减少了整体(SVRE)攻击,这可以降低集合模型的梯度方差,并充分利用集合攻击。标准想象数据集的经验结果表明,所提出的方法可以提高对抗性可转移性,并且优于现有的集合攻击显着。
translated by 谷歌翻译
不确定性遍及现代机器人自主堆栈,几乎每个组件(例如传感器,检测,分类,跟踪,行为预测)产生连续或离散的概率分布。尤其是,轨迹预测被不确定性所包围,因为其输入是由(嘈杂)上游感知产生的,并且其输出是通常用于下游计划中的概率的预测。但是,大多数轨迹预测方法并不能说明上游的不确定性,而仅考虑最明显的值。结果,感知不确定性不会通过预测传播,并且预测通常过于自信。为了解决这个问题,我们提出了一种在轨迹预测中纳入感知状态不确定性的新方法,其关键组成部分是一种新的基于统计距离的损失函数,它鼓励预测不确定性,以更好地匹配上游感知。我们在说明性模拟和大规模的现实世界数据中评估了我们的方法,证明了它在通过预测和产生更校准的预测来传播感知状态不确定性方面的功效。
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Through a study of multi-gas mixture datasets, we show that in multi-component spectral analysis, the number of functional or non-functional principal components required to retain the essential information is the same as the number of independent constituents in the mixture set. Due to the mutual in-dependency among different gas molecules, near one-to-one projection from the principal component to the mixture constituent can be established, leading to a significant simplification of spectral quantification. Further, with the knowledge of the molar extinction coefficients of each constituent, a complete principal component set can be extracted from the coefficients directly, and few to none training samples are required for the learning model. Compared to other approaches, the proposed methods provide fast and accurate spectral quantification solutions with a small memory size needed.
translated by 谷歌翻译
Body Mass Index (BMI), age, height and weight are important indicators of human health conditions, which can provide useful information for plenty of practical purposes, such as health care, monitoring and re-identification. Most existing methods of health indicator prediction mainly use front-view body or face images. These inputs are hard to be obtained in daily life and often lead to the lack of robustness for the models, considering their strict requirements on view and pose. In this paper, we propose to employ gait videos to predict health indicators, which are more prevalent in surveillance and home monitoring scenarios. However, the study of health indicator prediction from gait videos using deep learning was hindered due to the small amount of open-sourced data. To address this issue, we analyse the similarity and relationship between pose estimation and health indicator prediction tasks, and then propose a paradigm enabling deep learning for small health indicator datasets by pre-training on the pose estimation task. Furthermore, to better suit the health indicator prediction task, we bring forward Global-Local Aware aNd Centrosymmetric Encoder (GLANCE) module. It first extracts local and global features by progressive convolutions and then fuses multi-level features by a centrosymmetric double-path hourglass structure in two different ways. Experiments demonstrate that the proposed paradigm achieves state-of-the-art results for predicting health indicators on MoVi, and that the GLANCE module is also beneficial for pose estimation on 3DPW.
translated by 谷歌翻译
Text classification, a core component of task-oriented dialogue systems, attracts continuous research from both the research and industry community, and has resulted in tremendous progress. However, existing method does not consider the use of label information, which may weaken the performance of text classification systems in some token-aware scenarios. To address the problem, in this paper, we introduce the use of label information as label embedding for the task of text classification and achieve remarkable performance on benchmark dataset.
translated by 谷歌翻译